VAST Challenge 2021
Extracted from VAST Challenge 2021 here
In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.
In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.
Extracted from VAST Challenge 2021 Mini Challenge 2 here.
Many of the Abila, Kronos-based employees of GAStech have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.
Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees’ knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.
This vehicle tracking data has been made available to law enforcement to support their investigation. Unfortunately, data is not available for the day the GAStech employees went missing. Data is only available for the two weeks prior to the disappearance.
To promote local businesses, Kronos based companies provide a Kronos Kares benefit card to GASTech employees giving them discounts and rewards in exchange for collecting information about their credit card purchases and preferences as recorded on loyalty cards. This data has been made available to investigators in the hopes that it can help resolve the situation. However, Kronos Kares does not collect personal information beyond purchases.
The objective of this assignment is to assist law enforcement authorities to ascertain varying purchases made by specific GASTech employees and to identify suspicious patterns of behaviour.
There are a total of 3 csv files provided for MC 2. They are:
There were a few issues that was needed to be resolved in both csv files.
cc_data <- read_csv("data/cc_data.csv")
loyalty_data <- read_csv("data/loyalty_data.csv")
employee_data <- read_csv("data/car-assignments.csv")
#convert timestamp from character into date/time format
cc_data$timestamp <- date_time_parse(cc_data$timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
#convert timestamp from character into date/time format
loyalty_data$timestamp <- date_time_parse(loyalty_data$timestamp,
zone = "",
format = "%m/%d/%Y")
#Amend string text for Katrina's Cafe
cc_data <- cc_data %>%
mutate(location = str_replace_all(location,
pattern = "Katerin.+",
replacement = "Katrina\x27s Caf\xe9"))
loyalty_data <- loyalty_data %>%
mutate(location = str_replace_all(location,
pattern = "Katerin.+",
replacement = "Katrina\x27s Caf\xe9"))
heatmap_cc <- cc_data %>%
mutate(time60 = round_date(cc_data$timestamp, "60 minutes"),
daydate = weekdays(timestamp),
tempdate = timestamp + 8*60*60,
weekend = chron::is.weekend(tempdate),
time = format(time60, format = "%H:%M")) %>%
select(-c(tempdate)) %>%
group_by(location, daydate, time) %>%
add_count(location, daydate, time, name = "count")
heatmap_cc_weekday <- heatmap_cc %>%
filter(weekend == FALSE)
pop_heatmap_cc_weekday <- heatmap_cc %>%
filter(location == "Abila Zacharo"|
location =="Brew've Been Served" |
location == "Gelatogalore" |
location == "Guy's Gyros" |
location == "Hallowed Grounds" |
location == "Hippokampos" |
location == "Katrina's Café" |
location =="Ouzeri Elian")
heatmap_cc_weekend <- heatmap_cc %>%
filter(weekend == TRUE)
pop_heatmap_cc_weekend <- heatmap_cc %>%
filter(location == "Kalami Kafenion"|
location == "Guy's Gyros" |
location == "Hippokampos" |
location == "Katrina's Café" |
location =="Ouzeri Elian")
x1 <- length(unique(heatmap_cc_weekday$count))
cc_colours1 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1)
p1 <- ggplot(heatmap_cc_weekday,
aes(location, time)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours1,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekdays") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
x1a <- length(unique(pop_heatmap_cc_weekday$count))
cc_colours1a <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1a)
p1a <- ggplot(pop_heatmap_cc_weekday,
aes(location, time)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours1a,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekdays") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
#svl <- "CC Time Weekday Heatmap.png"
#ggsave(svl)
x2 <- length(unique(heatmap_cc_weekend$count))
cc_colours2 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2)
p2 <- ggplot(heatmap_cc_weekend,
aes(location, time)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours2,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekends") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
x2a <- length(unique(pop_heatmap_cc_weekend$count))
cc_colours2a <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2a)
p2a <- ggplot(pop_heatmap_cc_weekend,
aes(location, time)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours2a,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Time (Static)", title = "Number of CC Transactions during Weekends") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
#svl <- "CC Time Weekend Heatmap.png"
#ggsave(svl)
###################################################################
#By Date
###################################################################
heatmap_cc <- cc_data %>%
mutate(time60 = round_date(cc_data$timestamp, "60 minutes"),
daydate = weekdays(timestamp),
tempdate = timestamp + 8*60*60,
weekend = chron::is.weekend(tempdate),
time = format(time60, format = "%H:%M"),
date = format(timestamp, format = "%m/%d/%Y")) %>%
select(-c(tempdate)) %>%
group_by(location, date) %>%
add_count(location, date, name = "count")
heatmap_cc_weekday <- heatmap_cc %>%
filter(weekend == FALSE)
heatmap_cc_weekend <- heatmap_cc %>%
filter(weekend == TRUE)
x1 <- length(unique(heatmap_cc_weekday$count))
cc_colours1 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1)
p3 <- ggplot(heatmap_cc_weekday,
aes(location, date)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours1,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Date", title = "Number of CC Transactions during Weekdays") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
#svl <- "CC Weekday Heatmap.png"
#ggsave(svl)
x2 <- length(unique(heatmap_cc_weekend$count))
cc_colours2 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2)
p4 <- ggplot(heatmap_cc_weekend,
aes(location, date)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours2,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Date", title = "Number of CC Transactions during Weekends") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
#svl <- "CC Weekend Heatmap.png"
#ggsave(svl)
heatmap_loy <- loyalty_data %>%
mutate(daydate = weekdays(timestamp),
tempdate = timestamp + 8*60*60,
weekend = chron::is.weekend(tempdate),
date = format(timestamp, format = "%m/%d/%Y")) %>%
select(-c(tempdate)) %>%
group_by(location, date) %>%
add_count(location, date, name = "count")
heatmap_loy_weekday <- heatmap_loy %>%
filter(weekend == FALSE)
heatmap_loy_weekend <- heatmap_loy %>%
filter(weekend == TRUE)
x1 <- length(unique(heatmap_loy_weekday$count))
cc_colours1 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x1)
p5 <- ggplot(heatmap_loy_weekday,
aes(location, date)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours1,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Date", title = "Loyalty Card Transactions during Weekdays") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
#svl <- "Loyalty Weekday Heatmap.png"
#ggsave(svl)
x2 <- length(unique(heatmap_loy_weekend$count))
cc_colours2 <- colorRampPalette(c('green', 'yellow', 'orange', 'red'))(x2)
p6 <- ggplot(heatmap_loy_weekend,
aes(location, date)) +
geom_tile(aes(fill = factor(count))) +
scale_fill_manual(values = cc_colours2,
name = "Frequency") +
#breaks = levels(count)[seq(1, x, by = 5)]) +
labs(x = "Locations", y = "Date", title = "Loyalty Card Transactions during Weekends") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
#svl <- "Loyalty Weekend Heatmap.png"
#ggsave(svl)


From the heat map visualisation, it can be observed that the following locations are largely popular during weekdays.
Similar observations were seen on both set of data in credit card and loyalty card transactions.


During weekends, the the popular locations are:
Similarly, observations were consistent on both set of data in credit card and loyalty card transactions.
To discover the popular periods for the locations, we could only use credit card transactions, since the timestap has a time element to it. This is what we discovered:


It was observed that a number of locations do not encounter a relatively higher peak throughout the day, except for the following locations:
During Breakfast (0730hrs to 0830hrs):
During Dinner (1930hrs to 2030hrs)
During Lunch (1330hrs to 1430hrs)
During Dinner (1930hrs to 2030hrs)
Unlike Guy Gyros, Jippolampos and Katrina’s Cafe have longer periods of higher transaction volume.
It was noticed that there are 34 distinct locations in the credit card transactions and 33 distinct locations in the loyalty card transactions. A quick comparison was made, and it was noticed that Daily Dealz was not inside the loyalty card transaction. An inference could be that the location does not provide any benefits for the usage of GASTech loyalty card.
txn_cc_plot <- ggplot(cc_data, aes(location, price)) +
labs(x = "Locations", y = "Price", title = "Transaction Amount by Credit Card") +
geom_boxplot_interactive(aes(tooltip = price),
stackgroups = TRUE) +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
txn_loy_plot <- ggplot(loyalty_data, aes(location, price)) +
labs(x = "Locations", y = "Price", title = "Transaction Amount by Loyalty Card") +
geom_boxplot_interactive(aes(tooltip = price),
stackgroups = TRUE) +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
less_txn_cc_plot <- cc_data %>%
filter(location != "Abila Airport" &
location != "Abila Scrapyard" &
location != "Albert's Fine Clothing" &
location != "Carlyle Chemical Inc." &
location != "Frydos Autosupply n' More" &
location != "Kronos Pipe and Irrigation" &
location != "Maximum Iron and Steel" &
location != "Nationwide Refinery" &
location != "Stewart and Sons Fabrication")
less_txn_cc_plot <- ggplot(less_txn_cc_plot, aes(location, price)) +
labs(x = "Locations", y = "Price", title = "Transaction Amount by Credit Card") +
geom_boxplot_interactive(aes(tooltip = price),
stackgroups = TRUE) +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
less_txn_loy_plot <- loyalty_data %>%
filter(location != "Abila Airport" &
location != "Abila Scrapyard" &
location != "Albert's Fine Clothing" &
location != "Carlyle Chemical Inc." &
location != "Frydos Autosupply n' More" &
location != "Kronos Pipe and Irrigation" &
location != "Maximum Iron and Steel" &
location != "Nationwide Refinery" &
location != "Stewart and Sons Fabrication")
less_txn_loy_plot <- ggplot(less_txn_loy_plot, aes(location, price)) +
labs(x = "Locations", y = "Price", title = "Transaction Amount by Credit Card") +
geom_boxplot_interactive(aes(tooltip = price),
stackgroups = TRUE) +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 0.5))
There are a series of transaction amount suspicious anomalies.
We will attempt to filter out the locations with high-paying transactions, to further observe if there are any more suspicious anomalies.
Another two transactions of $600 from Chostus Hotel and $477.60 from General Grocer could be suspicious anomalies.
The details of the suspicious transactions are as follows:
| timestamp | location | price | last4ccnum |
|---|---|---|---|
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 |
| 2014-01-16 11:25:00 | Nationwide Refinery | 4742.67 | 7792 |
| 2014-01-08 09:54:00 | Nationwide Refinery | 4513.16 | 9152 |
| 2014-01-14 16:28:00 | Nationwide Refinery | 4429.76 | 2276 |
| 2014-01-14 14:13:00 | Abila Scrapyard | 4277.40 | 2276 |
| 2014-01-17 19:44:00 | Albert’s Fine Clothing | 1239.41 | 1321 |
| 2014-01-18 12:03:00 | Chostus Hotel | 600.00 | 5010 |
| timestamp | location | price | loyaltynum |
|---|---|---|---|
| 2014-01-18 | General Grocer | 477.6 | L9362 |
#Preparing the map
bgmap <- raster("data/Geospatial/MC2-tourist.tif")
Abila_st <- st_read(dsn = "data/Geospatial",
layer = 'Abila')
Reading layer `Abila' from data source
`D:\MadwolfDT\DataViz_Blog\_posts\2021-07-13-assignment-1\data\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
gps <- read_csv("data/gps.csv")
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M:%S")
x <- gps %>%
mutate(datestamp = as.Date(Timestamp + 60*60*8))
gps <- x
##convert
gps_sf <- st_as_sf(gps,
coords = c("long", "lat"),
crs = 4326)
##string to gps path
gps_path <- gps_sf %>%
group_by(id) %>%
summarize(m = mean(Timestamp),
do_union = FALSE) %>%
st_cast("LINESTRING")
#Discover top locations recorded
locations_gps <- gps
#Discard the 5th decimal place and have an accuracy of 11.1m
locations_gps$lat <- round(locations_gps$lat, digits = 4)
locations_gps$long <- round(locations_gps$long, digits = 4)
In order to analyse the GPS data more efficiently and effectively, there is a need to determine possible Places of Interests (POIs) in Abila, Kronos. To do so, we can break them down into the following categories:
It was mentioned that GASTech company vehicles are installed with GPS and “the vehicles are tracked periodically as long as they are moving”. With that, as long as the vehicles are in stationary, the GPS would not be tracking the movement. With that, we could analyse the gps data and determine common Places of Interests (POI) by finding our the time lag between each data, grouped by the ID, and analyse all the lat/long coordinates that has a time lag of more than 3 minutes. Since we are analysing for POIs, the accuracy could be in the range of 11.1m, thus, we would only use lat/long up to 4 decimal points.
records_POI <- locations_gps %>%
mutate(datestamp = as.Date(Timestamp + 60*60*8)) %>%
group_by(id) %>%
mutate(stop = Timestamp - lag(Timestamp)) %>%
mutate(parked = ifelse(stop >60*3, TRUE,FALSE)) %>%
mutate(lat111 = trunc(lat*1000)/1000,
long111 = trunc(long*1000)/1000) %>%
ungroup() %>%
filter(parked == TRUE) %>%
group_by(id, datestamp) %>%
add_count(id, datestamp, name = "visitcount") %>%
ungroup() %>%
rename(timestamp = Timestamp)
records_POI <- records_POI[c(1,5,2,3,4,8,9,6,7,10)]
d_records_POI <- records_POI %>%
distinct(lat,long, .keep_all = TRUE)
d_POI_tif_sf <- st_as_sf(d_records_POI,
coords = c("long", "lat"),
crs = 4326) %>%
st_cast("POINT") #%>%
tmBase <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
alpha = 0.5,
saturation = 1,
interpolate = TRUE,
max.value = 255)
tmrd_records_POI <- tmBase +
tm_shape(d_POI_tif_sf) +
tm_dots(size = 0.3,
alpha = 1,
col = "black")

It was observed that there are 265 identified POIs in Abila. As mentioned, the POI may include homes and office. Therefore, we would attempt to segregate out the homes from these POIs. We will filter out locations, for those that had remained parked for 8hours or more.
home_POI <- records_POI %>%
filter(stop >= 60*60*8) %>%
distinct(id, lat, long, .keep_all = TRUE) %>%
group_by(id) %>%
arrange(id, lat, long) %>%
mutate(near_lat = lat - lag(lat),
near_long = long - lag(long),
lat111 = trunc(lat*1000)/1000,
long111 = trunc(long*1000)/1000) %>%
mutate(drop = ifelse(is.na(near_lat),FALSE,
ifelse(between(near_lat, -0.0001,0.0001),
ifelse(between(near_long, -0.0001,0.0001), TRUE, FALSE),
ifelse(between(near_long, -0.0001,0.0001),TRUE, FALSE))))
############################################
#Aggregating ID = 28
############################################
x <- home_POI %>%
filter(id != 28)
y <- home_POI %>%
filter(id==28)
y$drop <- TRUE
y$drop <- ifelse(y$lat == 36.0732 & y$long == 24.8759,FALSE, y$drop)
y$drop <- ifelse(y$lat == 36.0524 & y$long == 24.8761,FALSE, y$drop)
x <- rbind(x, y)
home_POI <- x
##############################################
d_home_POI <- home_POI %>%
filter(drop == FALSE)
d_home_POI_tif_sf <- st_as_sf(d_home_POI,
coords = c("long", "lat"),
crs = 4326) %>%
select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long,drop)) %>%
st_cast("POINT") #%>%
x <- left_join(d_home_POI_tif_sf, d_home_POI, by = c("timestamp", "id")) %>%
select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long,drop))
d_home_POI_tif_sf <- x
tmBase_i <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
alpha = 0.5,
saturation = 1,
interpolate = TRUE,
max.value = 255)
tm_home_POI_i <- tmBase +
tm_shape(d_home_POI_tif_sf) +
tm_dots(size = 0.05,
alpha = 1,
col = "blue")
tm_home_POI <- tmBase +
tm_shape(d_home_POI_tif_sf) +
tm_dots(size = 0.3,
alpha = 1,
col = "blue")
There are a total of 53 distinct locations which had GPS-marked vehicles parked for more than 8 hours. They can be a mixture of offices or homes. In addition, it was observed that Isande Borrasca, a Drill Technician with GASTech had numerous locations marked for staying for more than 8hours. That is because the GPS receiver inside his car is likely to have technical issues due to erractic location marking. However, since the general route could be observed, GPS tampering has been ruled out. As such, the 2 distinct locations that are identifed with Isande Borrasca are aggregated to the following:
gps_path_28 <- gps_sf %>%
filter(id==28) %>%
summarize(m = mean(Timestamp),
do_union = FALSE) %>%
st_cast("LINESTRING")
tmap_mode("plot")
x <- tmBase +
tm_shape(gps_path_28) +
tm_lines(col = "blue")
x

d_home_or_office_POI <- d_home_POI %>%
group_by(id) %>%
add_count(id, name = "count") %>%
ungroup() %>%
filter(count >= 2) %>%
select(-c(count))
d_cfm_home_POI <- d_home_POI %>%
group_by(id) %>%
add_count(id, name = "count") %>%
ungroup() %>%
mutate(category = "Home") %>%
filter(count ==1 ) %>%
filter(id < 100) %>% #remove trucks
select(-c(count))
master_POI_list <- d_cfm_home_POI %>%
select(-c(timestamp, datestamp, lat111, long111, parked, drop, visitcount, near_lat, near_long, stop))
x <- rbind(master_POI_list, c("GASTech", 36.0480, 24.8796, "Office"))
x$lat <- as.double(x$lat)
x$long <- as.double(x$long)
master_POI_list <- x
d_cfm_home_POI$category <- ifelse(d_cfm_home_POI$id > 100, "Office", d_cfm_home_POI$category)
distinct_home_or_office <- d_home_or_office_POI %>%
group_by(lat, long) %>%
add_count(lat, long, name = "count") %>%
ungroup()
add_1 <- d_home_or_office_POI %>%
filter(id == 1) %>%
mutate(category = "Home")
add_1 <- add_1[-1,]
add_5 <- d_home_or_office_POI %>%
filter(id == 5) %>%
mutate(category = "Home")
add_5 <- add_5[-1,]
add_6 <- d_home_or_office_POI %>%
filter(id == 6) %>%
mutate(category = "Home")
add_6 <- add_6[-1,]
add_17 <- d_home_or_office_POI %>%
filter(id == 17) %>%
mutate(category = "Home")
add_17 <- add_17[-1,]
add_22 <- d_home_or_office_POI %>% #share residence with 30?
filter(id == 22) %>%
mutate(category = "Home")
add_22 <- add_22[-1,]
add_30 <- d_home_or_office_POI %>% #share residence with 22?
filter(id == 30) %>%
mutate(category = "Home")
add_30 <- add_30[-1,]
add_25 <- d_home_or_office_POI %>% #2 residences?
filter(id == 25) %>%
mutate(category = "Unknown")
add_9 <- d_home_or_office_POI %>% #5 places?
filter(id == 9) %>%
mutate(category = "Unknown")
add_21 <- d_home_or_office_POI %>% #2 places?
filter(id == 21) %>%
mutate(category = "Unknown")
add_28 <- d_home_or_office_POI %>% #2 places?
filter(id == 28) %>%
mutate(category = "Unknown")
x <- rbind(d_cfm_home_POI, add_1, add_17, add_21, add_22, add_25, add_28, add_30, add_5, add_6, add_9)
rm(add_1, add_17, add_21, add_22, add_25, add_28, add_30, add_5, add_6, add_9)
d_cfm_home_POI <- x
d_cfm_home_POI_tif_sf <- st_as_sf(d_cfm_home_POI,
coords = c("long", "lat"),
crs = 4326) %>%
select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long, drop)) %>%
st_cast("POINT") #%>%
x <- left_join(d_cfm_home_POI_tif_sf, d_home_POI, by = c("timestamp", "id")) %>%
select(-c(datestamp, lat111, long111,stop, parked, visitcount, near_lat, near_long,drop, timestamp))
x <- left_join(x, employee_data,
by = c("id" = "CarID"))
x <- select(x, -c("CurrentEmploymentType"))
x <- x[c(1,8,7,6,4,5,2,3)]
d_cfm_home_POI_tif_sf <- x
tm_d_cfm_home_POI_i <- tm_shape(d_cfm_home_POI_tif_sf) +
tm_dots(size = 0.05,
col = "category",
palette = c("red", "blue", "green"))
tm_d_cfm_home_POI <- tm_shape(d_cfm_home_POI_tif_sf) +
tm_dots(size = 0.3,
alpha = 1,
col = "category",
palette = c("red", "blue", "green"))
tm_d_cfm_home_POI_labels <- tm_shape(d_cfm_home_POI_tif_sf) +
tm_dots(size = 0.1,
alpha = 0,
col = "blue")
print_points_i <- function(df.x, start_time, end_time, dot_colour){
st1 <- date_time_parse("2014-01-13 10:00",
zone = "",
format = "%Y-%m-%d %H:%M")
et1 <- date_time_parse("2014-01-13 15:00",
zone = "",
format = "%Y-%m-%d %H:%M")
x <- records_POI %>%
filter(timestamp >= st1 & timestamp <=et1)
x_tif_sf <- st_as_sf(x,
coords = c("long", "lat"),
crs = 4326) %>%
st_cast("POINT") %>%
select(-c(datestamp, lat111, long111, stop, parked, visitcount))
tm.x <- tmBase_i +
tm_shape(x_tif_sf) +
tm_dots(size = 0.05,
alpha = 1,
col = dot_colour)
}
save_emp_routes <- function(emp_id, query_date) {
emp_home_point <- d_cfm_home_POI %>%
filter(id == emp_id) %>%
select(lat, long)
emp_office_point <- d_cfm_home_POI %>%
filter(category == "Office") %>%
distinct(lat, long)
emp_home_office_points <- rbind(emp_home_point, emp_office_point)
emp_POIs <- records_POI %>%
filter(id == emp_id) %>%
filter(datestamp == query_date) %>%
select(-c(visitcount)) %>%
group_by(lat, long) %>%
add_count(lat, long, name = "Number of Visits")
emp_home_office_tif_sf <- st_as_sf(emp_home_office_points,
coords = c("long", "lat"),
crs = 4326) %>%
st_cast("POINT")
emp_POIs_tif_sf <- st_as_sf(emp_POIs,
coords = c("long", "lat"),
crs = 4326) %>%
select(-c(datestamp, lat111, long111, stop, parked, id)) %>%
arrange(timestamp) %>%
mutate(sequence = 1:n()) %>%
st_cast("POINT")
emp_POIs_tif_sf <- emp_POIs_tif_sf[c(3,1,2,4)]
query_path <- gps %>%
filter(id == emp_id) %>%
filter(datestamp == query_date)
#convert to coordinates
query_path_sf <- st_as_sf(query_path,
coords = c("long", "lat"),
crs = 4326)
#string to gps path
query_gps_path <- query_path_sf %>%
summarize(m = mean(Timestamp),
do_union = FALSE) %>%
st_cast("LINESTRING")
tmemp_home_office_i <- tm_shape(emp_home_office_tif_sf) +
tm_dots(size = 0.08,
alpha = 1,
col = "green")
tmemp_home_office <- tm_shape(emp_home_office_tif_sf) +
tm_dots(size = 0.8,
alpha = 1,
col = "green")
tmemp_POIs_i <- tm_shape(emp_POIs_tif_sf) +
tm_dots(size = 0.05,
alpha = 1,
col = "blue")
tmemp_POIs <- tm_shape(emp_POIs_tif_sf) +
tm_dots(size = 0.5,
alpha = 1,
col = "blue")
tmemp_gps_path_i <- tm_shape(query_gps_path) +
tm_lines()
#prints non-interactive size
tmOverall <- tmBase + tmemp_gps_path_i + tmemp_home_office + tmemp_POIs
}
save_emp_routes_i <- function(emp_id, query_date) {
emp_home_point <- d_cfm_home_POI %>%
filter(id == emp_id) %>%
select(lat, long)
emp_office_point <- d_cfm_home_POI %>%
filter(category == "Office") %>%
distinct(lat, long)
emp_home_office_points <- rbind(emp_home_point, emp_office_point)
emp_POIs <- records_POI %>%
filter(id == emp_id) %>%
filter(datestamp == query_date) %>%
select(-c(visitcount)) %>%
group_by(lat, long) %>%
add_count(lat, long, name = "Number of Visits")
emp_home_office_tif_sf <- st_as_sf(emp_home_office_points,
coords = c("long", "lat"),
crs = 4326) %>%
st_cast("POINT")
emp_POIs_tif_sf <- st_as_sf(emp_POIs,
coords = c("long", "lat"),
crs = 4326) %>%
select(-c(datestamp, lat111, long111, stop, parked, id)) %>%
arrange(timestamp) %>%
mutate(sequence = 1:n()) %>%
st_cast("POINT")
emp_POIs_tif_sf <- emp_POIs_tif_sf[c(3,1,2,4)]
query_path <- gps %>%
filter(id == emp_id) %>%
filter(datestamp == query_date)
#convert to coordinates
query_path_sf <- st_as_sf(query_path,
coords = c("long", "lat"),
crs = 4326)
#string to gps path
query_gps_path <- query_path_sf %>%
summarize(m = mean(Timestamp),
do_union = FALSE) %>%
st_cast("LINESTRING")
tmemp_home_office_i <- tm_shape(emp_home_office_tif_sf) +
tm_dots(size = 0.08,
alpha = 1,
col = "green")
tmemp_home_office <- tm_shape(emp_home_office_tif_sf) +
tm_dots(size = 0.8,
alpha = 1,
col = "green")
tmemp_POIs_i <- tm_shape(emp_POIs_tif_sf) +
tm_dots(size = 0.05,
alpha = 1,
col = "blue")
tmemp_POIs <- tm_shape(emp_POIs_tif_sf) +
tm_dots(size = 0.5,
alpha = 1,
col = "blue")
tmemp_gps_path_i <- tm_shape(query_gps_path) +
tm_lines()
#prints non-interactive size
tmOverall <- tmBase + tmemp_gps_path_i + tmemp_home_office_i + tmemp_POIs_i
}
#getting ALL the ids with positive home categorisation
emp_with_cfm_home <- d_cfm_home_POI %>%
filter(category == "Home") %>%
select(id)
#n <- nrow(emp_with_cfm_home)
#
#for(i in 1:n){
#
# print_map <- save_emp_routes(emp_with_cfm_home$id[i], "2014-01-06")
#
# #tmap_save(print_map, paste("06 Jan Route of ", emp_with_cfm_home$id[i], ".png", sep = ""))
#
#}
It was determined that GASTech location is at (36.0480/ 24.8796) because there are high frequency counts by 6 of the employees, suggesting that the following 6 employees have stayed for more than 8 hours in office, at least once, during the past 2 weeks.
tmap_mode("view")
x <- tmBase_i + tm_d_cfm_home_POI_i + tm_d_cfm_home_POI_labels
x
From the map and table, it could be inferred that:
Now that we have determined the homes and office of various employees, we can proceed to investigate on the transaction anomalies.
txn10000_cc <- cc_data %>%
filter(price == 10000)
txn10000_cc %>%
kbl() %>%
kable_styling()
| timestamp | location | price | last4ccnum |
|---|---|---|---|
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000 | 9551 |
txn10000_loy <- loyalty_data %>%
filter(timestamp == "2014-01-13") %>%
filter(location == "Frydos Autosupply n' More")
txn10000_loy %>%
kbl() %>%
kable_styling()
| timestamp | location | price | loyaltynum |
|---|---|---|---|
| 2014-01-13 | Frydos Autosupply n’ More | 188.57 | L8328 |
| 2014-01-13 | Frydos Autosupply n’ More | 64.60 | L6110 |
| 2014-01-13 | Frydos Autosupply n’ More | 202.05 | L9018 |
| 2014-01-13 | Frydos Autosupply n’ More | 87.57 | L2169 |
txnFrydos <- cc_data %>%
mutate(datestamp = as.Date(timestamp + 60*60*8)) %>%
filter(datestamp == "2014-01-13") %>%
filter(location == "Frydos Autosupply n' More")
txnFrydos %>%
kbl() %>%
kable_styling()
| timestamp | location | price | last4ccnum | datestamp |
|---|---|---|---|---|
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 | 2014-01-13 |
| 2014-01-13 19:41:00 | Frydos Autosupply n’ More | 188.57 | 8129 | 2014-01-13 |
| 2014-01-13 19:59:00 | Frydos Autosupply n’ More | 64.60 | 8411 | 2014-01-13 |
| 2014-01-13 21:11:00 | Frydos Autosupply n’ More | 202.05 | 2418 | 2014-01-13 |
The transaction was transacted by somebody holding onto the card, 9551. And a check with the loyalty card data for all transactions on 13 Jan 2014 at Frydos Autosupply n’ More indicated that there are no transaction bearing $10,000. After crossing checking between credit card transactions and loyalty card transactions, only 2 transactions bear the exact price, suggesting the following:
We would go on to examine the routes for all the vehicles on 13 Jan and none of the route profile matches the spending pattern as suggested by the transactions. However, CarID 24 and 35 bore some resemblences and we would further examine.
all_card9551_txn <- cc_data %>%
filter(last4ccnum == 9551) %>%
mutate(datestamp = as.Date(timestamp+60*60*8)) %>%
filter(price != 10000)
card9551_txn_chart <- ggplot(all_card9551_txn,
aes(datestamp, price)) +
geom_raster(aes(fill = location)) +
labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
card9551_txn_chart

card9551 <- cc_data %>%
filter(last4ccnum == 9551) %>%
mutate(datestamp = as.Date(timestamp+60*60*8)) %>%
filter(datestamp == "2014-01-13") %>%
select(-c(datestamp))
card9551 %>%
kbl() %>%
kable_styling()
| timestamp | location | price | last4ccnum |
|---|---|---|---|
| 2014-01-13 06:04:00 | Daily Dealz | 2.01 | 9551 |
| 2014-01-13 13:18:00 | U-Pump | 55.25 | 9551 |
| 2014-01-13 13:28:00 | Hippokampos | 30.51 | 9551 |
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 |
| 2014-01-13 19:30:00 | Ouzeri Elian | 28.75 | 9551 |
#for(i in 1:30){
#
# print_map <- save_emp_routes(i, "2014-01-13")
#
# tmap_save(print_map, paste("13 Jan Route of ", i, ".png", sep = ""))
#
#}
#
#for(i in 32:35){
#
# print_map <- save_emp_routes(i, "2014-01-13")
#
# tmap_save(print_map, paste("13 Jan Route of ", i, ".png", sep = ""))
#
#}
#
#print_map <- save_emp_routes(101, "2014-01-13")
#
#tmap_save(print_map, paste("13 Jan Route of ", 101, ".png", sep = ""))
#
#for(i in 107:107){
#
# print_map <- save_emp_routes(i, "2014-01-13")
#
# tmap_save(print_map, paste("13 Jan Route of ", i, ".png", sep = ""))
#
#}
st1 <- date_time_parse("2014-01-13 00:00",
zone = "",
format = "%Y-%m-%d %H:%M")
et1 <- date_time_parse("2014-01-13 15:00",
zone = "",
format = "%Y-%m-%d %H:%M")
x <- records_POI %>%
filter(timestamp >= st1 & timestamp <=et1)
print.x <- print_points_i(x, st1, et1, "red")
tmap_mode("view")
print.x
It was observed that carID 24 was the only vehicle to have made a stop near U-Pump on 13 Jan before the transaction time at 13:18.
card24_1 <- print_map <- save_emp_routes_i(24, "2014-01-13")
card24_2 <- print_map <- save_emp_routes_i(24, "2014-01-14")
tmap_mode("view")
card24_1
card24_2
card24_POIs <- records_POI %>%
filter(id == 24) %>%
filter(datestamp == "2014-01-13"| datestamp == "2014-01-14") %>%
select(-c(visitcount, lat111, long111)) %>%
group_by(datestamp) %>%
arrange(timestamp) %>%
mutate("Sequence" = 1:n()) %>%
ungroup() %>%
select(-c(datestamp, id, stop, parked))
x <- card24_POIs %>%
left_join(select(master_POI_list, category, lat, long),
by = c("lat", "long"))
card24_POIs <- x
card24_POIs %>%
kbl() %>%
kable_styling()
| timestamp | lat | long | Sequence | category |
|---|---|---|---|---|
| 2014-01-13 07:32:01 | 36.0625 | 24.8988 | 1 | NA |
| 2014-01-13 08:07:01 | 36.0541 | 24.9012 | 2 | NA |
| 2014-01-13 11:16:01 | 36.0480 | 24.8796 | 3 | Office |
| 2014-01-13 11:46:01 | 36.0767 | 24.8576 | 4 | NA |
| 2014-01-13 12:31:01 | 36.0767 | 24.8576 | 5 | NA |
| 2014-01-13 13:22:01 | 36.0678 | 24.8715 | 6 | NA |
| 2014-01-13 17:57:01 | 36.0480 | 24.8796 | 7 | Office |
| 2014-01-13 19:29:01 | 36.0549 | 24.9018 | 8 | NA |
| 2014-01-14 03:20:01 | 36.0625 | 24.8989 | 1 | NA |
| 2014-01-14 07:47:01 | 36.0782 | 24.8721 | 2 | Home |
| 2014-01-14 08:18:01 | 36.0734 | 24.8642 | 3 | NA |
| 2014-01-14 11:19:01 | 36.0480 | 24.8796 | 4 | Office |
| 2014-01-14 11:52:01 | 36.0589 | 24.8928 | 5 | NA |
| 2014-01-14 12:17:01 | 36.0589 | 24.8928 | 6 | NA |
| 2014-01-14 14:04:01 | 36.0635 | 24.8510 | 7 | NA |
| 2014-01-14 17:46:01 | 36.0480 | 24.8796 | 8 | Office |
| 2014-01-14 18:54:01 | 36.0625 | 24.8988 | 9 | NA |
| 2014-01-14 19:02:01 | 36.0550 | 24.9018 | 10 | NA |
| 2014-01-14 20:41:01 | 36.0550 | 24.9019 | 11 | NA |
Observations from Vehicle 24.
While carID = 24 is suspicious, we are unable to conclusive link him to card 9551.
card35_1 <- print_map <- save_emp_routes_i(35, "2014-01-13")
card35_2 <- print_map <- save_emp_routes_i(35, "2014-01-14")
tmap_mode("view")
card35_1
card35_2
card35_POIs <- records_POI %>%
filter(id == 35) %>%
filter(datestamp == "2014-01-13"| datestamp == "2014-01-14") %>%
select(-c(visitcount, lat111, long111)) %>%
group_by(datestamp) %>%
arrange(timestamp) %>%
mutate("Sequence" = 1:n()) %>%
ungroup() %>%
select(-c(datestamp, id, stop, parked))
x <- card35_POIs %>%
left_join(select(master_POI_list, category, lat, long),
by = c("lat", "long"))
card35_POIs <- x
card35_POIs %>%
kbl() %>%
kable_styling()
| timestamp | lat | long | Sequence | category |
|---|---|---|---|---|
| 2014-01-13 06:46:01 | 36.0763 | 24.8747 | 1 | NA |
| 2014-01-13 06:59:01 | 36.0675 | 24.8733 | 2 | NA |
| 2014-01-13 12:18:01 | 36.0480 | 24.8796 | 3 | Office |
| 2014-01-13 13:41:01 | 36.0558 | 24.9026 | 4 | NA |
| 2014-01-13 17:54:01 | 36.0480 | 24.8796 | 5 | Office |
| 2014-01-13 19:39:01 | 36.0763 | 24.8747 | 6 | NA |
| 2014-01-13 19:51:01 | 36.0767 | 24.8576 | 7 | NA |
| 2014-01-13 20:36:01 | 36.0767 | 24.8576 | 8 | NA |
| 2014-01-14 06:38:01 | 36.0762 | 24.8747 | 1 | Home |
| 2014-01-14 07:00:01 | 36.0675 | 24.8734 | 2 | NA |
| 2014-01-14 07:09:01 | 36.0675 | 24.8733 | 3 | NA |
| 2014-01-14 12:37:01 | 36.0480 | 24.8796 | 4 | Office |
| 2014-01-14 13:36:01 | 36.0598 | 24.8580 | 5 | NA |
| 2014-01-14 18:01:01 | 36.0480 | 24.8796 | 6 | Office |
| 2014-01-14 19:09:01 | 36.0762 | 24.8747 | 7 | Home |
| 2014-01-14 20:26:01 | 36.0767 | 24.8576 | 8 | NA |
While CarID 35 may have the stops similar to the transactor of $10,000, the timestamp for the stops do not coincede with the transactions. In addition, the routes for the 2 days do not have any anomalies sighted.
Now that we have identified the specific POIs, we can now turn our focus to identify anomalies in the movement of the vehicles.
gps_employee <- left_join(gps, employee_data,
by = c("id" = "CarID"))
x <- gps_employee %>%
mutate(time = as_hms(Timestamp)) %>%
group_by(id, datestamp) %>%
mutate(count = n()) %>%
ungroup()
group1_gps_employee <- x %>% filter(id >= 1 & id <= 9) %>%
mutate(name = paste(FirstName, LastName, sep =" "))
group2_gps_employee <- x %>% filter(id >= 10 & id <= 18) %>%
mutate(name = paste(FirstName, LastName, sep =" "))
group3_gps_employee <- x %>% filter(id >= 19 & id <= 27)%>%
mutate(name = paste(FirstName, LastName, sep =" "))
group4_gps_employee <- x %>% filter(id >= 28 & id <= 35)%>%
mutate(name = paste(FirstName, LastName, sep =" "))
truck_gps_employee <- x %>% filter(id >= 101 & id <= 107)%>%
mutate(name = paste(FirstName, LastName, sep =" "))
moving_chart1 <- ggplot(group1_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
moving_chart2 <- ggplot(group2_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
moving_chart3 <- ggplot(group3_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
moving_chart4 <- ggplot(group4_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
moving_chart5 <- ggplot(truck_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "Time", y = "Date", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))





From the charts, a couple of anomalies were observed by the following employees:
(to list out)
We will examine them in detail.
gps_employee <- left_join(gps, employee_data,
by = c("id" = "CarID"))
x <- gps_employee %>%
mutate(time = as_hms(Timestamp)) %>%
group_by(id, datestamp) %>%
mutate(count = n()) %>%
ungroup()
detail_group1_gps_employee <- x %>% filter(id == 1 |
id == 16 |
id == 28 |
id == 9 |
id == 8 |
id == 5) %>%
mutate(name = paste(FirstName, LastName, sep =" "))
detail_group2_gps_employee <- x %>% filter(id == 15 |
id == 21 |
id == 26 |
id == 24 |
id == 19 |
id == 29) %>%
mutate(name = paste(FirstName, LastName, sep =" "))
detail_moving_chart1 <- ggplot(detail_group1_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "", y = "", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
detail_moving_chart2 <- ggplot(detail_group2_gps_employee,
aes(label = lat, label2 = long)) +
facet_wrap(.~name) +
geom_point(aes(time, datestamp)) +
labs(x = "", y = "", title = "Vehicle Moving Plot") +
theme(axis.text.x = element_text(size = 8, angle = 45, vjust = 1.1, hjust = 1.1),
axis.text.y = element_text(size = 6),
plot.title = element_text(hjust = 0.5))
#ggplotly(detail_moving_chart1)
#ggplotly(detail_moving_chart2)
#detail_moving_chart1
#detail_moving_chart2
The above table would be able to infer the credit card holder to the specific loyalty card number. However, there were uncertainties in the data. For instance, credit card transactions timestamp were indicated with date and time while loyalty card transactions were indicated with date only. There may be occasions where the date, price and location may match between both sets of data but they were different transactions for the day.
The transactions involved are:
9 Jan 2014, Katrina’s Café transaction amount of $26.60
9 Jan 2014, Guy’s Gyros transaction amount of $8.23
11 Jan 2014, Hippokampos transaction amount of $63.21
There were 2 transactions at each location bearing the same price at the same date inside loyalty data. However, we were unsure which transaction from each dataset belongs to which. And therefore, these observations would be excluded.
It was observed that:
1286 credit card holder uses 2 loyalty cards that is L3288 and L3572
L6267 loyalty card holder uses 2 credit cards that is 6691 and 6899
We can see that owner of 1286 used two different loyalty cards at various instances.